{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<hr>\n",
      "# PYTHON FOR DATA SCIENCE\n",
      "### Exploring a Real World Example\n",
      "\n",
      "#Agenda\n",
      "\n",
      "- <p style=\"color: red\">Define the problem and the approach</p>\n",
      "- Data basics: loading data, looking at your data, basic commands\n",
      "- Handling missing values\n",
      "- Intro to scikit-learn\n",
      "- Grouping and aggregating data\n",
      "- Feature selection\n",
      "- Fitting and evaluating a model\n",
      "- Deploying your work\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##In this notebook you will\n",
      "\n",
      "- Determine if the problem is worth solving\n",
      "- Define an approach\n",
      "- Develop a workflow outline"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<hr>\n",
      "## Revisiting Kaggle's Give Me Some Credit Competition\n",
      "\n",
      "### Intro\n",
      "\n",
      "#### We'll be exploring an age-old prediction problem--predicting risk on consumer loans.\n",
      "\n",
      "#### We've selected this topic as a case study because the problem is well-defined and familiar to most people. Additionally, this is a binary classification problem, so the strategies we apply should be relatively extensible to other problems you may encounter totally unrelated to credit and finance.\n",
      "\n",
      "### About the data\n",
      "\n",
      "The data is made available to us by Kaggle and was used in a competition in 2011.\n",
      "\n",
      "[http://www.kaggle.com/c/GiveMeSomeCredit](http://www.kaggle.com/c/GiveMeSomeCredit)\n",
      "\n",
      "### About the prediction problem\n",
      "\n",
      "Predict the probability that somebody will experience financial distress in the next two years."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Developing an understanding of the data and the problem\n",
      "\n",
      "#### Key questions we'll need to keep in mind\n",
      "\n",
      "- How do losses occur?\n",
      "- What are the characteristics that constitute credit default?\n",
      "- How often do people \"go bad?\"\n",
      "- How might we improve loss rates?"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import pandas as pd\n",
      "df = pd.read_csv(\"./data/credit-training.csv\")\n",
      "df.head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>SeriousDlqin2yrs</th>\n",
        "      <th>RevolvingUtilizationOfUnsecuredLines</th>\n",
        "      <th>age</th>\n",
        "      <th>NumberOfTime30-59DaysPastDueNotWorse</th>\n",
        "      <th>DebtRatio</th>\n",
        "      <th>MonthlyIncome</th>\n",
        "      <th>NumberOfOpenCreditLinesAndLoans</th>\n",
        "      <th>NumberOfTimes90DaysLate</th>\n",
        "      <th>NumberRealEstateLoansOrLines</th>\n",
        "      <th>NumberOfTime60-89DaysPastDueNotWorse</th>\n",
        "      <th>NumberOfDependents</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td> 1</td>\n",
        "      <td> 0.766127</td>\n",
        "      <td> 45</td>\n",
        "      <td> 2</td>\n",
        "      <td> 0.802982</td>\n",
        "      <td>  9120</td>\n",
        "      <td> 13</td>\n",
        "      <td> 0</td>\n",
        "      <td> 6</td>\n",
        "      <td> 0</td>\n",
        "      <td> 2</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td> 0</td>\n",
        "      <td> 0.957151</td>\n",
        "      <td> 40</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0.121876</td>\n",
        "      <td>  2600</td>\n",
        "      <td>  4</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td> 1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td> 0</td>\n",
        "      <td> 0.658180</td>\n",
        "      <td> 38</td>\n",
        "      <td> 1</td>\n",
        "      <td> 0.085113</td>\n",
        "      <td>  3042</td>\n",
        "      <td>  2</td>\n",
        "      <td> 1</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td> 0</td>\n",
        "      <td> 0.233810</td>\n",
        "      <td> 30</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0.036050</td>\n",
        "      <td>  3300</td>\n",
        "      <td>  5</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
        "      <td> 0</td>\n",
        "      <td> 0.907239</td>\n",
        "      <td> 49</td>\n",
        "      <td> 1</td>\n",
        "      <td> 0.024926</td>\n",
        "      <td> 63588</td>\n",
        "      <td>  7</td>\n",
        "      <td> 0</td>\n",
        "      <td> 1</td>\n",
        "      <td> 0</td>\n",
        "      <td> 0</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 4,
       "text": [
        "   SeriousDlqin2yrs  RevolvingUtilizationOfUnsecuredLines  age  \\\n",
        "0                 1                              0.766127   45   \n",
        "1                 0                              0.957151   40   \n",
        "2                 0                              0.658180   38   \n",
        "3                 0                              0.233810   30   \n",
        "4                 0                              0.907239   49   \n",
        "\n",
        "   NumberOfTime30-59DaysPastDueNotWorse  DebtRatio  MonthlyIncome  \\\n",
        "0                                     2   0.802982           9120   \n",
        "1                                     0   0.121876           2600   \n",
        "2                                     1   0.085113           3042   \n",
        "3                                     0   0.036050           3300   \n",
        "4                                     1   0.024926          63588   \n",
        "\n",
        "   NumberOfOpenCreditLinesAndLoans  NumberOfTimes90DaysLate  \\\n",
        "0                               13                        0   \n",
        "1                                4                        0   \n",
        "2                                2                        1   \n",
        "3                                5                        0   \n",
        "4                                7                        0   \n",
        "\n",
        "   NumberRealEstateLoansOrLines  NumberOfTime60-89DaysPastDueNotWorse  \\\n",
        "0                             6                                     0   \n",
        "1                             0                                     0   \n",
        "2                             0                                     0   \n",
        "3                             0                                     0   \n",
        "4                             1                                     0   \n",
        "\n",
        "   NumberOfDependents  \n",
        "0                   2  \n",
        "1                   1  \n",
        "2                   0  \n",
        "3                   0  \n",
        "4                   0  "
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df.SeriousDlqin2yrs.mean()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 5,
       "text": [
        "0.066839999999999997"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Building a Model\n",
      "Since we have a fairly large volume of data (150K), we're going to build a predictive model that will enable us to automatically score each applicant. We will provide each applicant with a credit score. This will give us an easy to interpret, human readable form of the model."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Our Strategy\n",
      "If we're building a model, we're going to need a way to know whether or not it's working. Convincing other is oftentimes the most challenging parts of an analysis. Making repeatable, well documented work with clear success metrics makes all the difference.\n",
      "\n",
      "For our classifier, we're going to use the following build methodology:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from IPython.core.display import Image\n",
      "Image(url=\"https://s3.amazonaws.com/demo-datasets/traintest.png\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<img src=\"https://s3.amazonaws.com/demo-datasets/traintest.png\"/>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 6,
       "text": [
        "<IPython.core.display.Image at 0x110327410>"
       ]
      }
     ],
     "prompt_number": 6
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Other Things to consider\n",
      "\n",
      "- [Precision and Recall](http://en.wikipedia.org/wiki/Precision_and_recall) - What good is our classifier if declines everyone?\n",
      "- [Overfitting](http://en.wikipedia.org/wiki/Overfitting) - Is your model describing noise or signal?\n",
      "- [Algorithms](http://en.wikipedia.org/wiki/Statistical_classification#Algorithms) - What type of classifiers might work in this scenario?"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##We just did the following\n",
      "\n",
      "- Determined we had a problem in which there was ample room for improvement\n",
      "- Decided we could use a predictive model to decrease losses\n",
      "- Proposed a high level workflow"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}